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PARTI 


Web Scraping Basies 


CHAPTER1 


Introduction 



1.1 What Is Web Scraping? 





whlch in ihe early days of computing (ihink 1960s-B0t) oflen bolled dmm lo simple. 














OtAPTER 1 INTRODUCRON 



R£PL is wailing for your comraands. Try exocutlng the following commands: 









>■ 'NoneType' and 'Int' 















»> li ■ ['a', 2, False] 8 not ali elements need to be the sane type 
>» 11 . [[3], [3, 4], [1, 2, 3]] 8 even lists ot llsts 
»> 11 . [1, 2, 4, 3] 

»> 11 ( 0 ] 

»> U[-l] 

»> 11(1:3] 

[2, 4] 

»> U[2:] 

(4, 3] 

»> 111:3] 

(1. 2, 4] 

»> li(::2] ff general foraat is li[start:end:step] 

(1. 4] 

»> 11[::-I] 

(3, 4, 2, 1] 

»> dei 11(2] « 11 is now [1, 2, 3] 

>» li.renove(2) 8 li is now [1, 3] 

>» li.insertfl, 1008) 8 11 is now [1, 1000, 3] 

»> [i, 2, 3] ♦ (10, 20] 

[i, 2, 3, 10, 20] 

»> 11 - [1, 2, 3] 

>» li.extend([l, 2, 3]) 


[1, 2, 3, 1, 2, 3] 



>» li.index(2) 


li.index(200) 




OtAPTER 1 INTRODUCtlON 





»> type ((l)) ff a tuple of length ore has to have a cooma after the *• 


»> type{(l,J) 
»> type(()) 
»> len(tup) 














»> list(filled_dict .keys()) 

»> list(filled_dict.values()) 

[i, 2, 3] 

»> ‘one* in filledjiict 8 in checks based on keys 


»> filled_dict.get(‘one‘) 

»> filled_dict. get (“four") 

»> filled_dict. get ("four", 4) « default value if not foynd 

»> filledjiict [“four"] -44 also possible to add/update this way 
»> dei fiIled_dict[“one"] I removes the key ‘one* 





WlRODUCnON 






tndentation Some programmers find this white space indenlation frustrating 
when first worklng with Python, though it does undeniably lead to more readable 



Bigger than 5 


Readable It Blocks Remember that zero (0) integers, floats, and complex 
numbers all evaluate to False in Python. Similarty, empty strings, sets, tuples, lists, 
and dictionaries also evaluateto False, soinsteadofwrltingif len(my_list) > o:, 
you can simply use if my_list: as well, which is much easier to read. 












iclpful built-ii 


INIRODUCPON 








raige(J) 
ge(o, 3) 


list(range(3)) 


) csplidliy c 





is your 1or"-loop going in tirdes? Vou can press Conlrol+C on your keyboard Io 
executo a "Keyboard Interrupi" and stop the execubon of the current code block. 











IN1R0DUC110N 


>» def both_together(*axgs, **kwaigs): 







CHAPTER 2 


The Web Speaks HTTP 



2.1 The Magic of Networking 












Each Une in an HTTP messam- must end wllh <CllxLR> (the ASCII characlcts OD 


and OA). 























Finally, our rcqucst megsagc cnds wilh a blank <CHxLF> Una, and has no moasage 






Content-Encoding: gzip 








2.3 HTTP in Python: The Requests Library 







2 IHEWEBSfEAKSHTT 








r ■ lequests.get(url) 
prlnt(r.text) 


wanted to make sure that ttie examples we provide continue working as long as 
possible. Don't be too upset about staying in ttie “sate playgrountf for now, as 
varicus real-life examples are induded in ttie last chapter as welt. 



















□dc. vim'II s 








2.4 Query Strings: URLs with Parameters 


http: //wu.wcbsciapingfoidota 













dcalvvitbt 







i . requests.get(url) 








nol part of the URL paiameier valuc. 




11:1 
















alchttp/. May around wlth ihc V 'b," and "op" URLpanuncttrs. Vou should bc 



priirt(calc(4, 6, •*•)) 
printtcalcU, 6, V» 


’ ?title»List_of_Cane_of_Thiones_episodesaoldid-802553687 1 
t . requests.get(url) 
print(r.text) 










CHAPIER2 THE WEB SEEAKS HTTP 





CHAPTER 3 


Stirring the HTML and 
CSS Soup 


Cascuding Style Shects (CSS). This chaptcr Ihen discus* 
whlch will hclp us to tuakc sensu of the HTML and CSS■ 



3.1 Hypertext Markup Language: HTML 








































































3.3 Cascading Style Sheets: CSS 


Btforo tve can gct stattud with actually deatlng wllh HTML In Python, 



















3.4 The Beautiful Soup Library 

We're now ready to start working with HTML pagos using Python. Rocall thc lollotving 










r . requests.get(url) 

html_soup > BeautifulSoup(html_contents) 




To get rid of this uamlng, change code so that It looks lilte this: 
Bea u ti fulSoup ( YOURMARKUP }) 











Underscores If you don’t like writing underscores, Beautiful Soup also exposes 
most ol iis methods using "camelCaps" capltalization. So instoad of f ind_all, 
you can also use findAll if you prefer. 




print (htmI_soup. find( * hl 1 }) 
print(html_5oup.find(", {'id 1 : 1 p-lcgc 1 })) 

hl’, ’h2’)): 















Take Care Witt! Keyuvords Even although fhe **keywords argumenl can 
come in very helpful in practice, there are some important caveats to mention 
here. First of ali, you cannot use dass as a keywotd, as this is a reserved 
Python keyword. This is a pity, as this wili be one of the most frequently used 
attributes when hunting for content inside HTML Luckily, Beautifut Soup has 
provided a workaround. Instead of using class, just write class_ as follows: 
"find^lass-^myclass')". Note that name can also notbe used as a 
keyword. since that is what is used already as the first argument name for f ind 
and find_all. Sadly, Beaufiful Soup does not provide a nante_ altemative here. 


















SimUarlv. thc following line of codc: 
tig. f ind_all (' hl') 


tagChl') 















3.5 More on Beautiful Soup 




html_soup.find{['hl‘, 'Iu 1 ]) 



































PART II 


Advanced Web Scraping 


CHAPTER 4 


Delving Deeper in HTTP 



4.1 Working with Forms and POST Requests 


en ficlds, whfch will noi 




CHAP1ER4 DELVING DEEPHI IN HTTP 






. 




































MI! 





















<input type-'hidden" nameVprotection- Value-"2cl7abf5dsb4e326bea802600ff88405"> 






protection': '2cl7abf5dsb4e326b 


print(r.text) 


2600ff88405' 






tml_soup - BeautifulSoup(i. te» 
>_val - html_smip.-flnd<'ieput', 














ollotving in Python: 









thc informallon containcd In the fotm boforc cmbcdding it In the HTTP POST rcqucst 















m HTTP POST reque 







ndary-BOUNDARV 












4.2 Other HTTP Request Methods 












4.3 More on Headers 













t . xequests.get(url) 









r . requests.get(url) 
print(r.text) 



















print(r.text) 

print(r.headers) 




























Manual Oelelion Note that settng an "Expires" or 'Max-Age' attribute should 
not be regarded as being a striet Instruction. Users are free to delete cookies 
manually, for instance, or might simply switch to anotber browser or device as well. 


















ny_cookies - ('PHPSESSID': 'ijfatbjege4Slnsfn2b5ci?706'} 



PHPSESSID We use the PHP scripting language to power our examples, 
so Ihat the cookie name to identify a user's session is named 'PHPSESSID'. 
Other websites might use 'session," "SESSION JD," 'session Jd," or any other 
name as weil, Do note, however, that the value representiog a session should 
be constructed randomly in a hard-to-guess manner. Simply setting a cookie 






'1234'J> 


my_cookies[’PHPSESSID'] - r.cookies.get('PHPS£SSID') 

























I visittd the togin pago, and aro bonco nol 











ckylogin/' 



rickylogin/' 








r . «y_session.get(url, params.{'p' ; 'proterted'}) 
friirt(r.iequest.heaiters) 


CHAPTER4 DBVINGOEEPEBINimP 


Clearing GooRies tf you ever need to “dean" a sessio» by dearing its t 


cookies in requests) behave like normal Python dicbonaries. 


4.6 Binary, JSON, and Other Forms of Content 










h a *UnlcodeEncodcEnorr Thia is nol 




i ■ lequests.get(url) 


Don't Prini lt's not a good idea lo print outthe r.content attribute, as the 
large amount of text may easity crash your Python consote window. 


i using thls method, Python wib s 











r. raw provides a file-Uke objcct i 
is nol oftcn uscd dirccdy and is i 









Working wllh JSON-formatud repUe 






prlitt(T.jsonO) 

printfr. json().get(' results')) 

POST data as plaln ISON. Using a icquests* data aigumcntwillnotwoikliithiscase. 
Instead, wc nocd to use the json argumeni. whfch wm basicaliy instruet icquests 10 






print(r.jsonO) 




Intemal APIs Even if the website you wish to scrupe does not provide an 

networking information to see if you can spot JavaScript-driven requests to URL 
endpoints that retum nicely structured JSON data. Even although an API mlght 
not be documented, fetehing the information directfy from such "intemal APIs’ is 
always a good idea, as this will avoid having to deal with the HTML Soup. 





CHAPTER 5 


Dealing with JavaScript 



5.1 What Is JavaScript? 





CHAPTER5 KAUN6WIDI 



i . requests.get(uxl) 




i DEALMG WilH JAVASCRIPT 




prlnt(scrlpttag) 








print(r.json()) 



V\x6A\x6F\x69\x6EV'x25V\x73[...I\x6C"]; 


var _Oxe9a7"["\x63\x6F\x6F\x6B\x69\x65","\x6E\x6F\x6E\x63\x65\x302 
docunent[_oxe9a7[o]]- _0xe9a7[l] 

}$(function(){sc(); 


CHAPTER 5 OEAUKGWITH 


return decodeURICo«iponent([...], 

function(_0x593ex4){ 

return _0xll0b[2]« [...]} 

»(_0X110b[l6])L0X110b[l5]]({ 

nextSelector:_oxllob[9], 

9(_0xll0b[14])[_0xll0b[13j](function{_0x593ex5){ 

S(this)[_0xll0b[lo]]{_0x593ex2($(this)[_0xll0b[10]]()}); 

»(this)Loxiiob[u]]C.oxiiob[iim)J})}) 







Untangling JavaScript Security researchers will ottentimes try Io untangle 
obluscatad JavaScript coria such as the ooe seen here to figuro out what a 
particular piece of corie is dolng. Obviously, this is a daunting and exhausting task. 




OIAFTER 5 DEALHG WitH JAVASOflPI 




» 


PHPSESSIO 







'PHPSESSID': 1 !tc413«i3bgnijo2f qmioogdnv24' 


CHAPTER 5 DEAUMGWITH 





to check visitors to see whelher lhey're using a real browser belore showing 

Check." In these cases, it will also be extremety hard to reverse engineer its 
worklngs to pretend yoiTre a browser when using requests. 


5.3 Scraping with Selenium 




















XPath tules are: 


nplc 







ISI! 


i DEALMG WltH JAVASCRIPT 


scrapers, remember that you can rigllt-cllck HTML elemenl in the Elements tab of 
Chrome’5 Developer Tools and select "Copy, Copy selector' and "Copy XPath" to 
give you an oulline of what an appropriate selector or expression might look like. 







CHAPTER 5 DEALJN6 WitH JAVASCRIPT 




driver.get(url) 


lnput('Press ENTER to close the automated browser') 
driver.quit() 











CHAPTER 5 DfcAUKGWITH 








rrrr 




> DEALMG WitH JAVASCRIPT 








driver.get(url) 

quote_elements . MebOrlvtrMait(drivH, 10).untll( 
(By.CSSJELECTOR, ".qnote:not(.decode) - ) 





!!!!! I 


i DEALMG WltH JAVASCRIPT 


lavaScript as wdl, wo can use the execute_script metbod in order io scnd a lavaSciipt 






'arguments[o].scrollTop • argunents[o].scroilHeight', 



CHAPTER 5 DEAUN6WITH 








I rrrrrrr 


i DEALHG WflW JAVASCRIPT 




driver.get(uil) 









Cllck it to give it focus 
ctiondiiin.clickO 









CHAPTER 5 DEALKGWITH 










# Or: drlver.find_element_by_css_selector('lnput[type-"submit']').click() 
lnput(' Press ENTER to close the autoMted browser') 
driver.quit() 







dick(iight- 




i mi 




lnputfPiess ENTER to close the autoMted browsci') 
drivei.quit() 




CHAPTER 6 


From Web Scraping 
to Web Crawling 



Illi! 










CHAPTER6 FilOW WEB SCWFtIG TO V|5B CIV 















5 FTVOM WEB SCRAPING TO WEI 





fiom urlllb.paise iaport uiljoin, urldefrag 















fulljirl . urldefragCfulljirlHo] 1 ** 
if not full_url.startswlth(base_url): 








5 FROM WEB SCRAPIN6 TO WEI 




url-url) 
•xcept IntegrityError 








CHAPTER6 FilOM WEB SCRAPMG TO YiEB 


•xcept IntegrityError as le: 


store_image(url, impuri, img_file): 

db.query{'' 'INSERI 1NT0 images (uri, imgjirl, img_file) 
VALUES (:url, :img_url, limg.file)'", 
url-url, imgjirl-imgjjrl, tag_flle.lmg.file) 
except IntegrityError as ie: 



get.random.unuls it ed_page (): 







■BJai»*** 























PART III 


Managerial Concerns 
and Best Practices 


CHAPTER 7 


Managerial and Legal 
Concerns 



7.1 The Data Science Process 









0WTB17 MANAG£RIAL AND LEGAL CGNCERMS 











7.3 Legal Concerns 



thc tcchnical barriet; It liad cstabllshed to prevcnt hlQ’s use of data. Unkcdln tried to 
argue that hlQ Labs vlolated thc 1986 Computer Fraud and Abuse Act by scraping data 





ff II 






CHAPTER 8 


Closing Topics 


iverview □( olhcr helpful 


8.1 OtherTools 


8.1.1 Alternative Python Libraries 




8.1.3 Caching 


CHAPTEH8 CLOSING TOPICS 





8.1.4 ProxyServers 





Service provlding e pool of HTTP proiy serve rs (see, e.g., https://proxymesh.com^o,by 
using anonymity Services such as Ibr (see https ://hww. torpioject. oig/). which is Irce, 

oflbr 'exit poinis" anii block ihem. For some soild imp proxy server implemenlatlons, 
take a look at Squid (http ://um. squid-coche. org /) or Flddler [https: //hww. telerik . 
com/fiMler). 




8.1.6 Command-Line Tools 


CHWTEIIB CL05ING TOPIOS 




















CHAPTER 9 




CHAPTER9 EXAMFUS 





Brcaking CAPTClLVs U«lng Deep Lcamlng: Illis cxamplc 
CAPTCHAs. 






9.1 Scraping Hacker News 



r . lequests.get(uil) 

html soup - BeautifulSoup(i.text, 'html.parser') 


iox item in html _soup.find_all('tr', class_-'athing‘): 
item_a • item.fiti<J('a', dass_-'stoiylink') 






9.2 Using the Hacker News API 

(see https ://github.com/HackezNetts/API). Lct's icworkour Python codc (o Dow servo 
as an API Client wkthout relying on Beauliful Soup (or HTML parsing: 








9.3 Quotes to Scrape 


'O store thls Information in a SQLitc da[abasc. Insicad of usmg the rcci 






















Flgure9-1. Explormg an SQLite dati 


r for SQLite’ 












9.4 Books to Scrape 





fxom datetime i^nrt datetime 

trod urllib.parse i^nrt uiljoin, urlparse 


def scrape_book5(html_soup, uri): 

book uri ■ book.find('h3').find('a').gut('href') 
book_url . urljoin(url, book.url) 
path - urlparse(book_url).path 



ook_id . path.split('/')[2] 
b[' books' ]. upsext ({' book_id 1 : bookid, 
}, ['bookid']) 


book['title'] > main.find('hl').get_text(strip-Tioe) 
book['stock'] - nain.find(class_-'availability').get_text(strip-True) 

book['img'] - htinl_soup.find(class_-'ChiBflbnail').find('lAg').get{'src') 




.find_next_sibling{'p') \ 


rt to make sure SQLite will accept it 
header ■ re.sub('[*a-zA-Z]+', header) 
value - xow.find{'td').get_text(strip>True) 

b['book_info'].upsert(book, ['book_id']) 






CHWTEH9 EXAMPLE5 


9.5 Scraping GitHub Stars 










blog JavaScript l 
minecraft-python JavaScript 


ind to not set up thls kind of scrapers on a large scale before knowing what you're 
letting into. Refer back to the chapter on legal concerns for the detaits regarding 
he legality of scraping. 


You wkll necd to create a GitHub ptofUe In case you haven't donc so already. Let us 
at by getting out the logln form Iram the login page: 



uri - 'https://github.eom/0' 

r ■ session.get (uri.formate logln')) 
htmlsoup - BeautifulSoupfr.text, 'html.parser') 


:.find(id-'login') 












uri - 'https://github.eom/0' 

r ■ session.get(uri.format('login')) 
htmlsoup • Beautiful$oup(r.text, 'htfol.parser') 




sword': ")) 




'login': 'YOJRJJSERJiAME 1 , 







login (y/n): 


Belgium 

macuyikoggmail. com 





























3 §1 



print(r.json(}) 







s.uk/mortgages/' ■* \ 







CHWrtns EIAMPUS 


»rirt(r^son{)['header']) 








prlnt (mort gages [o ]) 


'ctaType': None, 'uniqueld': '590b357e295b0377d0ti)6c 
'howHuchCanBeBorrowedNote': '951 (max) of the value 
'initialRateNote': 'untll 31st January 2023', 


9.7 Scraping and Visualizing IMDB Ratings 



episodos . [] 
ratings • [) 


e/tt0944947/episodes' 








































MSOSPWebPartManager_StartWebPartEditingfiame|false|5IhiddenField| 
MSOSPWebPartKanager_EndWebPartEditing|false| 











'User-Agent' : 'Hozilla/5.0 (Windows HT 10.0; Min64; x64) 
AppleWebKit/537.36 ' * ' (KHTHL, like Gecko) OiroM/62.0.3202.62 
Safaii/537.36' 

» 



html_soup - BeautifulSoup(r.text, 'htfol.parser') 
form ■ htffll_soup.find(id-'aspnetForm') 



rt')): 




{'_wpcmWpid'; ", 

'MSOWebPartPage_Postbac kSouro 
’MSOTlPn_SelectedWpId 1 : ", 


CHAFTEI19 EXAMHE5 


'HSOTlPn_ShowSettings': 'False', 
'HSOGallery_SelectedLibrary': ", 
'HSOCalleiy_FilterString': ", 

'_EVENTTARGET': ", 

'_EVENTARGUMENT': ", 



I Search by 











'User-Agent' : 'Hozllli/S.o (WindnH NT 10.0; Hln64; >64) 
AppleWebKit/537.36 ' • ' (KHTML, likt Cecko) Chrae/62.0.3202.62 

» 

i . session.get(url) 

html_soup • BeautifulSoup(r.text, 'htfol.parser') 

foim ■ htmlsoup. findfid- 'aspnetForm') 

for inp in fozm.find_all{['input', ’select']): 
nane - inp.get('nante') 
valoe . inp.get('value') 


btmlsoup - BeautifulSoup(r.text, 'htnl.parser') 
df ■ pandas.read_html(stx(table)) 




9.9 Scraping and Analyzing Web Forum 
Interactions 





uri - 'http://bpbasecamp.freeforuK.net/lmard/27/eear-closet' 
prlmt(threads) 




















-U matplotlib 




lnport pickle 





» tu not li> users[fu]: 


users[fu][tu] . 0 
users[fu][tu] ♦. 1 









%**i i 





■!“> 












CHWIHI9 EXAMPIE5 


9.10 Collecting and Clustering a Fashion Data Set 



import urljoin, utlparse 










I I I I I I t i i 

II 'i I' *il I i I 
nmiii i I 

Ii lilii 1 "i i TI 



logelher with "malplolUb; "sdpy, - and "numpy," aU of whlch are librarias Ibat are 












CHWIEB9 EXAMFUS 


Image Sizes We're lucky that all of the images we've scraped have the same 
width and height II this woutd not be the case, we'd tirst have to apply a resizing 
to mako sure every image will iead to a vector in the data set with equal length. 


9.11 Sentiment Analysis of Scraped Amazon Reviews 

We'te golng tu scrape u hst of Amazon reviews with thcir ratings for a particular produci. 
We'U uso a book with plenty of reviews, say Learnlng Python byMaricLtitz, which can be 
found at https://tMt.amazon.com/Learning-Python~5th-Hark-Lutz/dp/144935S730/. 
if you tdick through "See all customer reviews." youdl end up at https: //waw. tmazon. 
eom/Learning-Python-sth-Nark-Lutz/pzoduct-zevieMS/144935S730/. Note that this 











CHWIHI9 EIAMPUS 


Finally, nolc thal the POST ccquesl does tiot retum a Iuli HTML page, bul rame Idnd 
[■solpf, 

'if(windnw.ue) { ues('id','revieusAjaxl', 1 FE7i80N?iaDZK6Q09S9L ’); 











printC ieview['Tatlng'], review['titl*']) 
db['reviews'].upsert(review, ['review.id']) 





CHAPTER9 EXAMPLES 












Flgurei 









9.12 Scraping and Analyzing News Articles 








CHWIEH9 EIAMF1XS 



' nozilla/readability/master/Readability. js'; 
d.get£lements8yTagNaae('head'}[0].appendChild(script); 
Hdocument)); 










'https://iau.glthubuseicontent.cai/iiozllla/readabillty/aastdr/Scadability.js' 
because Its MIME type ('text/plaln') is not executable, and striet MIME - 
type checking is enabled. 













tm 

'Eirr- 

m 
















!!!! 






wait.until(EC.presence_of_element_located((By.ID, "readabillty-script"))) 


drivei.quit() 



a JSON-formalted stiing instead of a raw lavaScrlpl objeci io Python, as Python wlU not 
bv able to makv sonae of this return value and convett it to a liat of None values (simple 











xetuxn JSOH.stringify(article); 




script.src ■ 'http://MMW.webscrapingfordatascience.com/readability/ 
Readability.js'; 

d.getElementsSyTagName('head'){o].appendChild(script); 

Kdocnment)); 


GET https://wvM.webscrapingfordatascience.com/readability/Readability.js *- 
net:: ERR_C0WECTI(W_a0SED 







'http://WMM.webscrapingfordatascience.com/readability/Readability.js' *■ 



CHMTEH9 EXAMPUS 


script uri ■ 'http://vMtf.webscrapingfordatascience.cca/readability/ 
Readability.js' 





return ISON.stringify(article); 


driver.get(base_url) 
newsurls -II 



print('Injecting script') 

returned_result ■ dri«r.execute_script(get_article_cmd) 
« Oo something with returned_result 
driver.quit() 



script uli - 'http://vMtf.websctapingfordatascience.cca/readability/ 
Readability.js' 



return ISON.stringify(artide); 






ttt 

















•’» [...1] 



dictionary . corpora. Dictionary ([ a [ 1 ] for a 
corpus . [dictionary. doc2bow(a[l]) for a in 


»]) 






[( 0 , 10 ), ( 1 , 17 ), ( 2 , 7 ), ( 3 , 11 ), [...]] 





CHWTER9 EIAMF1XS 














Scraping Topics There is stili a k)I of room to improve en this by, for example, 
exploring other topic model mapping algorithms, applying better tokenization, addin 
custom stop words. or expanding the set of articles or adjusling the parameters. 
Altematively, you might also considor scraping the tags for each artide straight fron 
the Google News page, which also includes these as "topics 11 on its page. 


9.13 Scraping and Analyzing a Wikipedia Graph 

In this example, wc'll work ance agaln wllh WUdpcdla (we already used Wlklpcdia In 


Troa urllib.parse i^nrt urljoin, urldefrag 

db • dataset .connectf'sqlite:///wikipedia.db') 
base_url • 'https://en.wilcipedia.org/wiki/' 

print(Wisited page:'| uri) 
print(' title:', title) 

db['pages'].upsert({'uri': uri, 'title': title), ['uri')) 


def store_links(from_url, links): 
db.beginf) 

db['links' ] -upsert ({'fronjirl': 

['fromjirl'. 


fromjirl, 'tojirl': torni), 
'to_url']) 













The scripti' 











Figure 3-14. 




QUP1ER9 EXAMPLES 



db ■ dataset.connect('5(|lite:///wikipedia.db') 
G - networkx.DiGraphO 


print('8uilding graph...') 
for pago lo db['pages'].all(): 

G.addnode (page[’uri'], title»page['tltle']) 
for link io db['links'].all(): 

If G.has_node(link[ 'froo_urrJ) and G.hasjiode(link['tojirl']): 
G.add_edge(link[ 'frouurl'], link[■tajoT]) 



printf'Calculating betweenness...') 

ft Signoid function to make the colors {a little) aure appealing 





colors . [(0, 0, squash(betweenness[n])) for n io G.nodes()] 
labeis - dict((n, d['title']) for n,^d i. G.nodes(data.True» 


plt.text(v[o], v[l]*0.025, s.labels[k], 

horizontalalignment-' center', size.8) 


9.14 Scraping and Visualizing a Board Members 
Graph 






pagenav - soup.find(class_-'page«avigation') 










df - pd.read_pickle('sp500.pkl') 

G - nx.Graph{) 

for row in df .ltertuplesQ: 











CHARACTERS ■ li>t('QWERTPASOfGHKLZXBNW') 
NR_CAPTCHAS ■ 1000 
NR_CHARACTERS • 4 





for i in raoga(NR_CAP1CHAS): 

captcha - ''.join([choice(CHARACTERS) for c ln ranga(NRCHARACTERS)]) 
filenane ■ os.path.Join(CAPTCHA_FOLDER, '{)_().png' .format(captdia, i)) 
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Flgure9-I7. A collectiori ofgeneratedCAPTCHAimages 














«suit . resuit[np.ix_(retain.*ny(i), retain.*>r<0))] 
cv2.imshow('Final resuit', resuit) 




, 















if My{[l.shape[o] < 10 or l.shape[l] < 10 for 1 in letters)): 
prlnt('[IJ Som of the extrarted characteri aro too snall') 











1 a s | 2 tt s tt ’s,sisssss 
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CHWTEH9 EIAMPUS 







(sanple.dtype, vai.uid, sti(vai.dtype))) 




1665/1665 











CHWTtrl 9 EXAMPLE5 

CWy-py*): 

import pickle 


lb ■ pickle.loed(f) 


«odei - load_model(MOOEL_FILE) 


inuge_files - ll«t(glob(os.path.join(CAPTCHA_FOLDEK, '*.png')» 




iiHge ■ cv2. imiead (imagef ile) 







Index 





Apart (CAPTCHA), 19«, 199 





ud and Abusu Act (CFAA), 


Cooklo hijacking, 112 

Mining (CRISP-DM) 


D, E 



Digital Miucnnium Copyright Act 


(DMCA), 185 










